18 research outputs found

    Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with Expert Guidance

    Full text link
    Offline reinforcement learning (RL) optimizes the policy on a previously collected dataset without any interactions with the environment, yet usually suffers from the distributional shift problem. To mitigate this issue, a typical solution is to impose a policy constraint on a policy improvement objective. However, existing methods generally adopt a ``one-size-fits-all'' practice, i.e., keeping only a single improvement-constraint balance for all the samples in a mini-batch or even the entire offline dataset. In this work, we argue that different samples should be treated with different policy constraint intensities. Based on this idea, a novel plug-in approach named Guided Offline RL (GORL) is proposed. GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample. We theoretically prove that the guidance provided by our method is rational and near-optimal. Extensive experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements

    Boosting Offline Reinforcement Learning with Action Preference Query

    Full text link
    Training practical agents usually involve offline and online reinforcement learning (RL) to balance the policy's performance and interaction costs. In particular, online fine-tuning has become a commonly used method to correct the erroneous estimates of out-of-distribution data learned in the offline training phase. However, even limited online interactions can be inaccessible or catastrophic for high-stake scenarios like healthcare and autonomous driving. In this work, we introduce an interaction-free training scheme dubbed Offline-with-Action-Preferences (OAP). The main insight is that, compared to online fine-tuning, querying the preferences between pre-collected and learned actions can be equally or even more helpful to the erroneous estimate problem. By adaptively encouraging or suppressing policy constraint according to action preferences, OAP could distinguish overestimation from beneficial policy improvement and thus attains a more accurate evaluation of unseen data. Theoretically, we prove a lower bound of the behavior policy's performance improvement brought by OAP. Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging AntMaze tasks (98% higher).Comment: International Conference on Machine Learning 202

    Avalon's Game of Thoughts: Battle Against Deception through Recursive Contemplation

    Full text link
    Recent breakthroughs in large language models (LLMs) have brought remarkable success in the field of LLM-as-Agent. Nevertheless, a prevalent assumption is that the information processed by LLMs is consistently honest, neglecting the pervasive deceptive or misleading information in human society and AI-generated content. This oversight makes LLMs susceptible to malicious manipulations, potentially resulting in detrimental outcomes. This study utilizes the intricate Avalon game as a testbed to explore LLMs' potential in deceptive environments. Avalon, full of misinformation and requiring sophisticated logic, manifests as a "Game-of-Thoughts". Inspired by the efficacy of humans' recursive thinking and perspective-taking in the Avalon game, we introduce a novel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to identify and counteract deceptive information. ReCon combines formulation and refinement contemplation processes; formulation contemplation produces initial thoughts and speech, while refinement contemplation further polishes them. Additionally, we incorporate first-order and second-order perspective transitions into these processes respectively. Specifically, the first-order allows an LLM agent to infer others' mental states, and the second-order involves understanding how others perceive the agent's mental state. After integrating ReCon with different LLMs, extensive experiment results from the Avalon game indicate its efficacy in aiding LLMs to discern and maneuver around deceptive information without extra fine-tuning and data. Finally, we offer a possible explanation for the efficacy of ReCon and explore the current limitations of LLMs in terms of safety, reasoning, speaking style, and format, potentially furnishing insights for subsequent research.Comment: 40 page

    Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning

    Full text link
    Offline-to-online reinforcement learning (RL) is a training paradigm that combines pre-training on a pre-collected dataset with fine-tuning in an online environment. However, the incorporation of online fine-tuning can intensify the well-known distributional shift problem. Existing solutions tackle this problem by imposing a policy constraint on the policy improvement objective in both offline and online learning. They typically advocate a single balance between policy improvement and constraints across diverse data collections. This one-size-fits-all manner may not optimally leverage each collected sample due to the significant variation in data quality across different states. To this end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances. FamO2O utilizes a universal model to train a family of policies with different improvement/constraint intensities, and a balance model to select a suitable policy for each state. Theoretically, we prove that state-adaptive balances are necessary for achieving a higher policy performance upper bound. Empirically, extensive experiments show that FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark. Codes are available at https://github.com/LeapLabTHU/FamO2O.Comment: NeurIPS 2023 spotlight. 24 pages, 13 figure

    Multiple influence of immune cells in the bone metastatic cancer microenvironment on tumors

    Get PDF
    Bone is a common organ for solid tumor metastasis. Malignant bone tumor becomes insensitive to systemic therapy after colonization, followed by poor prognosis and high relapse rate. Immune and bone cells in situ constitute a unique immune microenvironment, which plays a crucial role in the context of bone metastasis. This review firstly focuses on lymphatic cells in bone metastatic cancer, including their function in tumor dissemination, invasion, growth and possible cytotoxicity-induced eradication. Subsequently, we examine myeloid cells, namely macrophages, myeloid-derived suppressor cells, dendritic cells, and megakaryocytes, evaluating their interaction with cytotoxic T lymphocytes and contribution to bone metastasis. As important components of skeletal tissue, osteoclasts and osteoblasts derived from bone marrow stromal cells, engaging in ā€˜vicious cycleā€™ accelerate osteolytic bone metastasis. We also explain the concept tumor dormancy and investigate underlying role of immune microenvironment on it. Additionally, a thorough review of emerging treatments for bone metastatic malignancy in clinical research, especially immunotherapy, is presented, indicating current challenges and opportunities in research and development of bone metastasis therapies

    Field monitoring and numerical analysis on piled-raft foundation : case study

    No full text
    This thesis presents the result of detailed back-analysis, using three-dimensional finite-element analysis, of the instrumented piled-raft foundation in monitoring site. The piled-raft foundation is a composite foundation structure that consisting of piles, raft and surrounding soils acting as a whole system. To check the reliability of soil taking load under the raft and obtain a reasonable value of load proportion taken by piles for the soil conditions in Hong Kong, a piled-raft foundation was partially instrumented in the monitoring site. The pile head loading, raft-soil contact pressure of specified area and settlement at raft top for selected locations were being monitored. Comparisons of overall settlement, differential settlements and the load carried by the piles show reasonably good agreement. Followed by a 3D finite element modeling of the entire piled-raft foundation of the monitored site, the analysis includes a pile-soil slip interface model. The numerical analysis is performed to give insights to (1) load transfer behavior of the piled-raft foundation (2) effects of pile reduction on pile load ratio. Combined the observation from site monitoring and analysis results from the numerical analysis, the proportion of load shared between piles and raft is revealed as 7:3. The lower limit of pile ratio is proposed as 0.67 for the site after the parametric study by removing piles strategically. In spite of the settlement-reducing purpose of the piles, the design of piled-raft foundation still concentrate on providing adequate axial capacity, with settlement requirement treated as a secondary issue. The significance of the study is that it provides factual evidence of soil taking the load under the raft, and the economical benefits of piled-raft foundation as a reduction of piles will save more than 2 million of the construction budgets.published_or_final_versionCivil EngineeringMasterMaster of Philosoph

    Tiling a strip with triangles

    No full text
    Abstract In this paper, we introduce the tilings of a 2Ɨn "triangular strip" with triangles. These tilings have connections with Fibonacci numbers, Pell numbers, and other known sequences. We derive several different recurrences, establish some properties of these numbers, and give a refined count for these tilings (i.e., by the number and type of triangles used) and establish several properties of these refined counts

    Effectiveness and safety of 99Tc-methylene diphosphonate as a disease-modifying anti-rheumatic drug (DMARD) in combination with conventional synthetic (cs) DMARDs in the treatment of rheumatoid arthritis: A systematic review and meta-analysis of 34 randomized controlled trials

    No full text
    Background: Technetium [99Tc] methylene diphosphonate injection (99Tc-MDP) is widely used for the treatment of rheumatoid arthritis (RA), but there is still insufficient evidence for its application. Through the utilization of meta-analysis and systematic reviews, this study aimed to evaluate the effectiveness and safety of 99Ā TC-MDP in combination with conventional synthetic disease-modifying anti-rheumatic drugs (csDMARDs) for RA. Methods: This study was registered on PROSPERO in advance (CRD42021220780). A systematic search was conducted in PubMed, Embase, the Cochrane Library, and multiple international public databases from their inception to April 2023 to identify clinical randomized controlled trials exploring the use of 99Tc-MDP combined with csDMARDs in the treatment of RA. Each outcome was subjected to meta-analysis, and the quality of evidence was assessed according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. The American College of Rheumatology's 50Ā %/70Ā % response criteria scores (ACR50/70) scores were utilized as the primary effectiveness outcomes, and risks were measured by assessing the rates of AEs. Moreover, secondary efficacy outcomes were evaluated, including the Disease Activity Score 28 (DAS28) and bone mineral density (BMD) as joint function indicators and the erythrocyte sedimentation rate (ESR) and interleukin-17 (IL-17) as inflammatory indicators. Results: In this meta-analysis, a total of 34 studies (2296 patients) were included out of 1149 retrieved studies. The summarized results showed that the treatment group treated with the combination of 99Tc-MDP and csDMARDs had significantly higher ACR50 (RRĀ =Ā 1.32, 95Ā % CI: 1.13ā€“1.55, PĀ =Ā 0.0004) and ACR70 (RRĀ =Ā 1.40, 95Ā % CI: 1.07ā€“1.82, PĀ =Ā 0.01) scores than the control group receiving csDMARDs alone. In addition, the overall incidence of AEs was lower with the combination of 99Tc-MDP and csDMARDs than with csDMARDs alone (RRĀ =Ā 0.75, 95Ā % CI: 0.60ā€“0.93, PĀ =Ā 0.009), but the possibility of phlebitis was higher in the treatment group (RRĀ =Ā 4.15, 95Ā % CI: 1.04ā€“16.50, PĀ =Ā 0.04). In addition, the combination of 99Tc-MDP and csDMARDs had advantages over csDMARDs alone in improving DAS28 (WMDĀ =Ā 1.56, 95Ā % CI: 0.86ā€“2.25, PĀ <Ā 0.0001), BMD (SMDĀ =Ā 1.12, 95Ā % CI 0.46ā€“1.78, PĀ =Ā 0.0008), ESR (SMDĀ =Ā 0.71, 95Ā % CI 0.45ā€“0.97, PĀ <Ā 0.00001), and IL-17 (WMDĀ =Ā 5.82, 95Ā % CI 3.86ā€“7.77, PĀ <Ā 0.00001). However, the above results might have been influenced by the 99Tc-MDP dosage, csDMARD category, and treatment duration. Combining methotrexate and leflunomide, administering continuous treatment for 24 weeks, or using 3 sets of 99Tc-MDP doses (16.5Ā mg) may be the optimal 99Tc-MDP treatment plan for RA. Conclusion: Compared with csDMARD therapy alone, the combination therapy with 99Tc-MDP is more effective for RA patients and is associated with a lower overall incidence of adverse events, although the possibility of phlebitis was higher. However, due to the inherent limitations of the included RCTs, high-quality clinical trials are still needed to further assess the effectiveness and safety of this combination therapy

    Spaceā€“Time Analysis of Vehicle Theft Patterns in Shanghai, China

    No full text
    To identify and compare the space&ndash;time patterns of vehicle thefts and the effects of associated environmental factors, this paper conducts a case study of the Pudong New Area (PNA), a major urban district in Shanghai, China&rsquo;s largest city. Geographic information system (GIS)-based analysis indicated that there was a stable pattern of vehicle theft over time. Hotspots of vehicle theft across different time periods were identified. These data provide clues for how law enforcement can prioritize the deployment of limited patrol and investigative resources. Vehicle thefts, especially those of non-motor vehicles, tend to be concentrated in the central-western portion of the PNA, which experienced a dramatic rate of urbanization and has a high concentration of people and vehicles. Important factors contributing to vehicle thefts include a highly mobile and transitory population, a large population density, and high traffic volume

    On the spectral sidebandsā€™ evolution of mode-locked fiber lasers

    No full text
    Funding This work was supported by the National Natural Science Foundation of China (Grant Nos. 62275060); Natural Science Foundation of Heilongjiang Province (Grant Nos. LH2023F029 and LH2019F012).Peer reviewedPostprin
    corecore